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Abstract 

This paper presents a new equating method for the nonequivalent groups with anchor test design: 
poststratification equating based on true anchor scores. The linear version of this method is 
shown to be equivalent, under certain conditions, to Levine observed score equating, in the same 
way that the linear version of poststratification equating is equivalent to Tucker equating. Some 
issues related to this result are discussed. 

Key words: nonequivalent groups with anchor test design, poststratification equating based on 
true anchor scores, Levine observed score equating 



The nonequivalent groups with anchor test (NEAT) design is commonly used for 
equating the scores on different test forms in many high-volume, high-stakes testing programs. 
Choosing an equating method is often a major concern for the statisticians who work on these 
programs. Among the equating methods that do not require the strong assumptions of item 
response theory, the most commonly used are these: 1 

• Chained linear equating 

• Chained equipercentile equating 

• Tucker equating 

• Levine observed-score equating 

• Levine true-score equating 

• Poststratification equipercentile equating (PSEE), also called frequency estimation 
equipercentile equating 

Among these, only the chained equipercentile method and the poststratification 
equipercentile method allow for the possibility of a nonlinear equating relationship. This 
limitation of the other methods has important practical consequences, because when test forms 
differ in difficulty, the equating relationship is often curvilinear. 

For several years, psychometricians have been attempting to develop a curvilinear 
analogue to the Levine method. These attempts include 

• a method in which the equated score determined by PSEE is modified by adding the 
difference between the equated scores detennined by two linear equating methods: 
the Levine observed-score method and the Tucker method (von Davier, Fournier- 
Zajac, & Holland, 2006), 

• two methods in which PSEE is modified by transfonning the score distributions with 
the mean-preserving linear transfonnation (Chen & Holland, 2009; Wang & Brennan, 
2007), and 

• chained true-score equipercentile equating, a chained equipercentile equating of 
estimated true-score distributions (Chen & Holland, 2008). 
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The first three of the four methods described above use some kind of mathematical 
manipulation, either on the score distributions or on the equating function. They are not 
analogous to the Levine method in the way that PSEE is analogous to the Tucker method, 
producing the same equating function when the assumptions of Tucker equating are met (Braun 
& Holland, 1982). 

This paper defines a new equating method —poststratification equipercentile equating 
based on true anchor scores (PSEEta)— and demonstrates that it has the following properties: 

1. When the assumptions of Levine equating are met, the linear equating from the 
cumulative distributions produced by PSEE TA is Levine observed score equating. 

2. By applying a mean-preserving linear transformation to the joint distribution of 
observed test scores and anchor scores, one can estimate the joint distribution of 
observed test scores and true anchor scores. The PSEE of these modified distributions 
will then approximate the PSEE TA . The computation is identical to that of the method 
previously referred to as curvilinear Levine observed score equating (Chen & 
Holland, 2009). 


Equating Methods 

In this paper, X and Y will represent scores on the test fonns to be equated, with 
population P taking Form X and population Q taking Form Y, while A will represent the score 
on an anchor test taken by both populations P and Q. The equating relationship between X and Y 
is to be determined for the synthetic population S, defined as a weighted mixture of populations 
P and Q, represented in the proportions w, for population P, and (1 - w ), for population Q. The 
symbols T\ and Ex will represent the true-score and error components of score X, with similar 
notation for score Y and anchor score A. 

The general fonn of a linear equating from X to Y in population S is the following: 

y 

Tucker Equating 

Tucker equating consists of Equation 1 with estimates of the means and standard 
deviations that are based on the following assumptions (see Kolen & Brennan, 2004, p. 106): 
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1. The regression of Aon A and the regression of Y on A are population invariant. That 
is, they are the same in population S, where they cannot be directly observed, as in 
populations P and Q, where they can be observed. 

2. The conditional variance of X given A and the conditional variance of Y given A are 
population-invariant. 

Levine Observed Score Equating 

Levine observed score equating also has the fonn of Equation 1, but with estimates of the 
means and standard deviations that are based on different assumptions (see Kolen & Brennan, 
2004, p. 110): 

1. The regression of T x on T A and the regression of 7) on T A are population-invariant. 

2. The variances of the error components of X and of Y are population-invariant. 

3. True scores on the anchor and on the tests to be equated are perfectly correlated. 

4. The variance of the error component of A is population-invariant. 

Generalized Levine Observed Score Equating 

Generalized Levine observed score equating is defined as a linear equating of X to Y in 
population S , based on Equation 1, with estimates of the means and standard deviations that are 
based on the following assumptions: 

• Assumption 1—The regression of X on T A and the regression of Y on T A are 
population-invariant. 

• Assumption 2—The conditional variance of X given T A and the conditional variance 
of Y given T t are population-invariant. 

Under the usual classical test theory assumption that the errors in X and A are 
uncorrelated with their true scores and with each other, the slope of the regression of X on T A is 

c ov(X,T a ) = cov( T x , T,) + cov( E x ,T,) _ co v(T x ,T A ) ^ 

var(T 4 ) varfiT,) 

and the conditional variance of X given T A is 


var (T a ) 
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var(X | T a ) = var(X)[l -p 2 (X,T A )\ = var(X) 1 


cov 2 (X,T a ) 
var(X) var( T A ) 


= var(X) 


= var(X) - cov ~ ( ^^ ) = V ar(X) - p 2 (T x ,T A )var(T x ). 
y ar (T a ) 


cow\X,T a ) 
var (T a ) 


( 3 ) 


The same results occur for Y. 

Levine observed score equating then becomes a special case of generalized Levine 
observed score equating, with two additional assumptions: 

• Assumption 3—Both Tx and Ty are correlated perfectly with T A . 

• Assumption 4—The variance of the error component of A is population-invariant. 

Assumptions 3 and 4 are the assumptions identified above as Assumptions 3 and 4 of Levine 
equating. 

Assumption 1 of Levine equating is identical to Assumption 1 of generalized Levine 
equating, because the linear regression of T x on T t is the same as the linear regression of X on 
T A . (See Equation 2.) 

Assumption 2 of Levine equating says that the error variances of test scores X and Y are 
population independent. If Assumption 3 is true, then this assumption follows from Assumption 
2 of generalized Levine equating. Note that if p(T x , T A ) = 1, then, from Equation 3, the 
conditional variance of X given r A becomes simply var(A)- var( T x ), which is the variance of 
the error component ofX Therefore, Assumption 2 of generalized Levine equating (population 
invariance of the conditional variance of X given z A ) becomes identical to Assumption 2 of 
Levine equating (population invariance of the error component of A). A similar result holds for Y 
and T y . 

Poststratification Equipercentile Equating 

This method, also known as frequency estimation equipercen tile equating, is another 
classical equating method that applies to a NEAT design. (See Angoff, 1971, pp. 581-582; 

Braun & Holland, 1982, pp. 21-23; Kolen & Brennan, 2004, pp. 136-139). This method does 
not constrain the equating relationship to be a linear function. Assumptions 1 and 2 of the Tucker 
method are replaced by the assumption that the conditional distributions of X and of Y, given A, 
are population-invariant. 
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As before, the subscripts P and Q indicate the populations taking test forms X and 7; the 
subscript S indicates the synthetic population. F, G, and H represent cumulative distribution 
functions of scores X, Y, and A. In many cases, the distributions indicated by F and G will be 
conditional distributions, conditioning on the anchor score. For example, Fp{x\a) will represent 
the conditional distribution function of X, given A = a, in population P. 

The distribution function of anchor score A in population S is 

Ils(a) = wHp(a ) + (1 - w)I/q(ci). (4) 

Then the distribution of X in population S, is 

F s (x) = w\ F p (x\a)dH p (a) + (l -w)|.F e (x|a)<i// e (a). (5) 

The assumption that the conditional distribution of X given A = a is population-invariant 
makes it possible to substitute the conditional distributions in population P for the corresponding 
distributions in population Q , leading to an estimate for the distribution of X in population S, 

F s (x) = w\F p (x \a)dH p (a) + (l — w )j F p ( y x\ci)dF[ Q ( y a) 

( 6 ) 

= \F p (x | a)dH s (o), 
where H s (a ) is given by Equation 4. 

The assumption that the conditional distribution of Y given A = a is population-invariant 
yields a similar estimate 6’ v ( v) for the distribution of Y in population S. The poststratification 

/V /V 

equipercentile equating of A to 7 is the equipercentile equating from F s (x) to 6’ s (v). 

Curvilinear Levine Observed Score Equating 

Curvilinear Levine observed score equating (Chen & Holland, 2009) requires a 
transformation of the bivariate distributions of test scores and anchor scores—a mean-preserving 
linear transfonnation. The mean-preserving linear transformation with parameters X and v 
transforms (X, A) to a new pair of random variables (X, A'), 

X = jd x + X(X - ) 

A ' = Fa + v ( A ~ Fa\ 
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where fix and fi A are the means of X and A, respectively, and X and v are positive real numbers. 

The joint distribution ofX and A' has the same means as the joint distribution ofXand^f, 
as well as the same correlation coefficient, but the standard deviations of X and A are multiplied 
by the factors X and v. 

Chen and Holland (2009) noted that, in kernel equating (von Davier, Holland, & Thayer, 
2004), if the distributions are continuized with a very large bandwidth, poststratification 
equipercentile equating (PSEE) becomes nearly identical to Tucker equating. They then showed 
that applying the mean-preserving linear transformation, with an appropriate choice of X and v, to 
the joint distributions of (X, A) and of (7, A) and then continuizing these joint distributions with a 
very large bandwidth will make PSEE nearly identical to Levine observed score equating. On the 
basis of that result, they defined curvilinear Levine observed score equating as the process of 
transfonning the ( X, , A) and (7, A) distributions with these values of X and v and then using the 
transformed bivariate distributions to do poststratification equipercentile equating. 

Poststratification Equipercentile Equating Based on True Anchor Scores 

As the name suggests, this method (abbreviated PSEEta) is a variation of PSEE in which 
the anchor scores are replaced by their corresponding true scores, which is the same way that 
Levine observed score equating differs from Tucker equating. The conditional distributions of 
the tests X and 7, given the anchor true score, are assumed to be population-invariant. Using i a to 
represent a given true score on the anchor, one can estimate the cumulative distribution of X on 
population S as 

F s (x;t a ) =\F p {x\r a )dH s {r a ), (8) 

where F p (x | T a ) is the conditional distribution function of X given T A = r a in population P, and 
M O is the distribution function of T A in population S. The cumulative distribution of 7 in 
population S is estimated similarly by G s (y; r A ) , and the equating is the equipercentile equating 
from F s {x\ r ( ) to G s (y;r A ). 
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Results 


In this section it will be shown that, under certain conditions, the linear fonn of PSEEta 
is Levine observed score equating, and that the formula for curvilinear Levine observed score 
equating (Chen & Holland, 2009) can be used for PSEE TA . 

First, two tenns need to be defined. They are the conditional mean of X given the true 
anchor score, 

Jup(X\Ta) = lxdF/>(x\r a ), (9) 

and the conditional variance of X given the true anchor score: 

var P (X\T a ) = j[x - ,ui>(X\z a )] 2 dF P (x\z a ). (10) 

The corresponding terms for 7 are defined similarly. 

Theorem 

If in population P, ju p (X | r a ) is a linear function of x a , and var p (X r 1 is constant on 
T 4 , and if ju Q ( Y \ r a ) and var g (Y \z a ) have the same properties in population Q, then the linear 

equating from F s (x;t a ) to G s (v;t a ) is generalized Levine observed score equating. 

The proof is given in Appendix A. 

Corollary 

Additionally, if both T x and 7j are correlated perfectly with T, u and the variance of the 
error component of A is population-invariant, then the linear equating from F s (x; t A ) to 

G s ( v;r ,) is Levine observed score equating. 

Theoretically, PSEEta can be regarded as the curvilinear analogue to Levine observed 
score equating. However, this equating has no practical value unless one can estimate the 
distributions of (X, T A ) and (7, T A ). Some data models can produce the joint distribution of 
(X, T a ) or (7, T a ), but the fitting will sometimes fail the statistical tests for large size samples. 
One approach is to use the mean-preserving linear transformation of Equation 7 to transfonn ( X , 
A) and (7, A) into (X, A') and (7, A'), choosing the values of X x , v x , Ay, and vythat make the 
linear PSEE (i.e., Tucker equating) based on (X, A') and (7, A') equal to the linear PSEEta (he., 
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Levine observed score equating) based on the original datasets (X, A) and ( Y , A). Hence, the 
curvilinear form of PSEE TA based on the original datasets (X, A) and ( Y , A) can be approximated 
with the curvilinear form of PSEE based on (X, A') and (Y\ A'). 

One way to choose the values of the /Is and vs is to set a linear equating from X to Y, 


y = fi s (Y')+^Xx-f, s (X')}, (11) 

&s(X) 

where 

M S (X') =\mp(X' | a')dH s (a'), (12) 

Ms (Y') =\fi Q (Y'\a')dH s (a'), (13) 

a 2 s (X') = \ var p (X'\ a')dH s (a') +1ju 2 p (X'\ a')dH s (a') ~[i 2 s (X'), (14) 

and 

<4(0 =Jvar Q (T\a')dH s (a')y\^(Y'\a')dH s (a')-pHY'). (15) 


Then set the term on the left side of each of these equations (12, 13, 14, and 15) equal to 
the corresponding tenn in the Levine observed score equating from X to Y on S. One needs to 
adjust the marginal distributions of both X and Y to solve Equations 12, 13, 14, and 15, although 
in general the values of the 2s are close to 1. The formulas for the 2s and vs are given in 
Appendix B. 


Discussion 

The idea for PSEE TA came from observing the similarity between two equatings 
computed from the same data: a direct linear equating of distributions estimated by item response 
theory (IRT) and Levine observed score equating based on the same IRT-fitted data. (For the 
results of this comparison, see Chen, 2010.) It was then apparent that PSEE TA provides a more 
natural definition for curvilinear Levine observed score equating than the method previously 
given that name (Chen & Holland, 2009), which only provides a computational procedure for 
approximating PSEEta- 

Several issues related to the results in this paper suggest directions for further research. 
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First, it is problematic that the mean-preserving linear transfonnation is applied to the 
bivariate distribution of two observed score variables to simulate the bivariate distribution of an 
observed score variable and a true score variable. The process requires the computation of o T , 

o T , and so on, which in turn assumes a perfect correlation between the true scores T x and T, and 

some additional conditions on (X, A). These assumptions are often violated by nonlinearity 
between X and A. Hence, the ratio of a T J a p a /a^, and so forth can be off target by 1% to 5%, 

which sometimes can cause a difference in the equating results as large as 0.15 or even 0.2 
standard deviations. This violation of the assumption of a perfect true-score correlation between 
test and anchor scores explains why the Levine method is often not as accurate as other linear 
methods. Even if by chance <j t ^ and so on are estimated perfectly, the true score 

distribution estimated by applying the mean-preserving linear transfonnation procedure on the 
observed score distribution can be quite different from its actual distribution. It can be shown 
that when an IRT model is used to estimate the joint distributions of test scores and anchor 
scores, PSEE T a is quite different from PSEE with the mean-preserving linear transfonnation 
technique, particularly at the end-score points. Therefore, to make a reliable PSEEta procedure, 
it is necessary to develop a model that can produce the joint distribution for two variables— 
observed scores on the test and true scores on the anchor—and that can fit the data well for the 
majority of data sets. Some progress has been made in this direction. 

The second issue is more critical. In equating through an anchor, without IRT, the choice 
of an equating method involves a judgment as to the extent to which the assumptions of each 
method are likely to be satisfied by the data. Based on the data, one can choose the Tucker 
method, the Levine method, the chained linear method, their curvilinear versions, or other 
methods. On the other hand, IRT equating, with any IRT model, is actually poststratification 
equipercentile equating, stratifying on true anchor scores (Chen, 2010), which we now know is 
essentially the curvilinear form of the Levine method. Therefore, IRT equating, like Levine 
observed-score equating, will tend to make too large an adjustment for the ability difference 
between populations P and Q. Should one adopt the IRT viewpoint so that only Levine-type 
equating methods can be used? Or should one try to choose an equating method on a case-by- 
case basis, and if so, by what criteria? Certainly, many more theoretical and technical questions 
are waiting to be answered. 
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Conclusion 


In this paper, poststratification equipercentile equating, stratifying on true anchor scores 
(PSEEta) is defined, and its linear fonn is shown to be the Levine observed score equating 
method. Hence, this method, rather than the method defined in Chen and Holland (2009), is the 
method that should be called the curvilinear Levine observed score equating method. 

Based on the results in Chen, Livingston, and Holland (2010), most, if not all, of the 
commonly used equating methods for NEAT designs can be approximated as either (a) observed 
anchor score-based poststratification equating, which includes PSEE, Tucker, and Braun-Holland 
(Braun & Holland, 1982); or (b) true anchor score-based poststratification equating, which 
includes both Levine methods, hybrid Levine equipercentile equating, chained true score 
equipercentile equatings, curvilinear Levine observed score equating, and IRT-based equating; or 
(c) partially true anchor score-based poststratification equating, which includes chained 
equipercentile equating, chained linear equating, and modified poststratification equating (Wang 
& Brennan, 2007). Theoretically, the definition of modified poststratification equating is the 
same as that of PSEEta- Computationally, both methods use the mean-preserving linear 
transfonnation to modify the existing distributions, but modified poststratification equating 
estimates only the marginal distributions of the true anchor scores, while PSEE T a estimates the 
joint distribution of observed test scores and true anchor scores. 

Two important problems remain to be solved. The first problem is to develop a model for 
estimating a joint distribution of observed test scores and true anchor scores—one that can fit the 
data better than currently used latent variable models. The second problem is to develop a 
criterion (or a set of criteria) for detennining which method—PSEE, PSEEta, or some other 
equating method—is most appropriate for a given equating task. 
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Notes 


1 See Angoff, 1982; Braun & Holland, 1982. 
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Appendix A 

Proof of the Main Theorem 


One needs to show that Assumptions A1 and A2 are satisfied for both A and Y defined 
from PSEEta with the conditions in Theorem 1. 

Let jup(X\z a ) defined in Equation 9 be a linear function of r a , that is, 

,up(X\T a ) = a * x a + />', (Al) 

where both a and (> will be determined later. Sum up /up(X\x a ) on T t with population P, using 
both Equations 9 and Al, one has 

Hp(X) = a * fip(T A ) + fl. (A2) 

Then taking the mean of jUp(X\x a ) on T t with population P, using both Equations 9 and Al again, 
there is 


MX* T a ) = a * /up(T/) + P * MTa). (A3) 

Both Equations A2 and A3 are the same equations for solving the regression of X on T A 
in population P. Hence, if the regression of X on T A in population Q has the same a and /i, it can 
be said that the regression is population-invariant. 

By assuming that the conditional distribution of X given T A = x a is population-invariant, 
that is, Fq(x\x u ) = E>(x \x a ), one can define jUq(X \x a ) in the form of Equation 9 as 

ju Q (X\x a ) = \xdF Q (x\x a ) = \xdF P (x \x a ) = a * x a + [i. (A4) 

Hence, the regression of Aon T A is population-invariant. This proves that Assumption Al 
is valid for A. 

From Equations A2 and A3, one gets: 


a = 


cov(A ,T a ) 
a p(T A ) 


(A5) 


and 
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®P\Ta) 


(A6) 


With the assumption in the theorem that var/>(A|r a ) = c is a constant on Tj, the integration 
of \axp(X\xa) over T t in population P gives 


c = jU p [var p (X \T a )\ 

= var P (X) - var p [p p (X \ T t )] (Law of total variance) 
= var p (X) - var p (a * T A + J3) 

= var p (X) - a 2 var p (T 4 ) 


= var P (X) 


co v 2 p (X,T a ) 

var p(T A ) 


va r p (T a ) 


= var p (X)[\-p 2 p(X,T A )]. 


(A7) 


Here the law of total variance is used (see Rao, 1973, p. 97, Equation (2b.3.6); see also 
Equations A1 and A5 in this paper.) 

Similar to Equation 10, varg(A|r a ) can be defined as 

varo(A|r„) = j[x - hq{X\t ci )\ 2 dF Q (x\z a ). (A8) 

With the assumption that Fq(x \x a ) = Fp(x\x a ) and the result that juq(X }x a ) = jup(X\xa), one 
can see that varg(A|r fl ) = c is a constant on T t as well. Integrating var^ (X\x a ) over T A in 
population Q gives 


c = ™ Q {X)[\-p 2 Q {X,T A )l 


(A9) 


This proves that Assumption A2 is valid for X also. Similarly, both assumptions can be 
shown to be valid for Y. 
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Appendix B 

Formulas for Parameter Values in Equations 12-15 

Given (X, A) and (7, A) in a NEAT design, let 

X' = fix + /.AX- fix), and A'= ju Ap + vj(A - ju Ap ) on P, (B 1) 

and 

T = Hy + X 2 (Y~hy), and A' = ju Aq + v 2 (a - /u A q ) on Q, (B2) 

respectively. Where the subscripts to A indicate which population is used, the proper values to 
make so-called distributions based on true anchor scores are: 



Py 

where p x is a y !o x , and so forth. 

X 

In general, Aj and X 2 are close to 1 (Chen & Holland, 2009). 
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